Using place name data to train language identification models
نویسندگان
چکیده
The language of origin of a name affects its pronunciation, so language identification is an important technology for speech synthesis and recognition. Previous work on this task has typically used training sets that are proprietary or limited in coverage. In this work, we investigate the use of a publicallyavailable geographic database for training language ID models. We automatically cluster place names by language, and show that models trained from place name data are effective for language ID on person names. In addition, we compare several source-channel and direct models for language ID, and achieve a 24% reduction in error rate over a source-channel letter trigram model on a 26-way language ID task.
منابع مشابه
Language Identification of Bengali-English Code-Mixed data using Character&Phonetic based LSTM Models
Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language identification at the word level of low resource BengaliEnglish code-mixed data taken from social media. We employ two methods of word encoding, namely character based an...
متن کاملStatistical Identification of English Loanwords in Korean Using Automatically Generated Training Data
This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train ...
متن کاملPhonetic Landmark Detection for Automatic Language Identification
This paper presents a method of augmenting shifted-delta cepstral coefficients (SDCCs) with the classification outputs of an array of support vector machines (SVMs) trained to detect a set of manner and place features on telephone speech. The SVM array allows for broad phoneme classification, and when this information is concatenated with SDCCs to form a hybrid feature vector for each acoustic ...
متن کاملOpen-Set Language Identification
We present the first open-set language identification experiments using one-class classification models. We first highlight the shortcomings of traditional feature extractionmethods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One-Class Support Vector Machine using only a monolingual corpus f...
متن کاملG2P Conversion of Proper Names Using Word Origin Information
Motivated by the fact that the pronunciation of a name may be influenced by its language of origin, we present methods to improve pronunciation prediction of proper names using word origin information. We train grapheme-to-phoneme (G2P) models on language-specific data sets and interpolate the outputs. We perform experiments on US surnames, a data set where word origin variation occurs naturall...
متن کامل